1 Dataset Description

A total of 7880 individuals from 2611 families were genotyped on the Illumina Human1M-Duov3_B or the Human1Mv1_C.

  • 4901 males, 2979 females.
  • 2571 trios, 36 quads, 1 pentas, 3 hexs.
  • 947,233 SNPs were genotyped.
  • Coordinates were based on Build36.


2 Raw Genotype QC

2.1 Sex Check

  • 141 PRROBLEM
    • 115 with complete missing chrX genotypes.
    • 26 with chrX-F ranging from 0.20 to 0.62

2.1.1 Mismatch summary



2.1.2 ChrX F distributions



2.2 Pariwise IBD estimation

  • Relationships (RT): OT (Others), FS (Full Siblings), PO (Parent Offspring)
  • family ID 483 & 1012 has potential issue
    • FID:483 with IBD sharing = 1 between IID:328 (Female) and IID:1491 (Female), which were supposed to be ~0. Are they MZ? same individual? The genotype missing rates are 0.1185 and 0.1186 for IID:483_328 and IID:483_1491, respectively. [Drop IID:483_1491, IID:483_4371 and IID:483_993.]
    • FID:1012 with IBD sharing = 0.57 between IID:2319 (Female) and IID: 3612 (Female). They are supposed to be FS but recruited into the same FID. Other kinship between FID:1012 can be confirmed.
  • IBS sharing for other pairs: ranging from 0.44 to 0.58 in FS, from 0.50 to 0.59 in PO, from 0 to 0.12 in OT (which indicating inbreeding between some parents.)

2.2.1 Estimated pairwise IBD distributions



2.2.2 Family 483 & 1012



2.3 Individual genome-wide heterozygosity

2.3.1 Genome-wide heterozygosity VS missing rates



Note that samples were genotyped in the Human1M-Duov3_B or the Human1Mv1_C. Genotypes for these individuals are an union of the genotypes from both platforms. For missingness, only the intersecting SNPs between two arrays were used.



2.3.2 Genome-wide F VS missing rates



3 Imputation

3.1 Pre-imputation

The imputation pipeline follows that used for SSC dataset. A total of 7769 individuals and ~784K autosomal, ~22K chrX SNPs were used for further impution.

  • filters: --geno 0.05 --mind 0.1 --maf 0.01 --hwe 1e-6
    • 111 people removed due to missing genotype data (–mind). Their missingness rates ranging from 0.7 to 1.
    • Total genotyping rate in remaining samples is 0.914029.
    • 124565 variants removed due to missing genotype data (–geno).
    • 15633 variants removed due to Hardy-Weinberg exact test.


3.2 After Imputation

3.2.1 Frequency distribution

  • ~7.6M SNPs overlapped SNPs between AGP_imputed and HRC_WGS (passing filters: --geno 0.05 --maf 0.01 --hwe 1e-6)
  • based on same allele
  • 0 SNPs with MAF difference > 0.2



3.2.2 PCA

  • Project the first 3 PCs based on pruned HapMap3 SNPs onto 1000G
  • Using K-means to calculate distance
  • Assign ancestry based on posterior probability 0.9
    • 6548 Europeans (EUR), 625 Americans (AMR), 123 South-Asians (SAS), 99 East-Asians (EAS) and 141 Africans (AFR).